Collaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media

نویسندگان

  • Fatiha Sadat
  • Fatma Mallek
  • Mohamed Mahdi Boudabous
  • Rahma Sellami
  • Atefeh Farzindar
چکیده

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of social media. More precisely, this paper focuses on the Tunisian Dialect of Arabic (TAD) with an application on automatic machine translation for a social media text into MSA and any other target language. Linguistic tools such as a bilingual TAD-MSA lexicon and a set of grammatical mapping rules are collaboratively constructed and exploited in addition to a language model to produce MSA sentences of Tunisian dialectal sentences. This work is a first-step towards collaboratively constructed semantic and lexical resources for Arabic Social Media within the ASMAT (Arabic Social Media Analysis Tools) project.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation

Corpora are considered as an important resource for natural language processing (NLP). Currently, the Dialectal Arabic corpora are somewhat limited, particularly in the case of the Tunisian Arabic. In recent years, since the events of the revolution, the increasing presence of spoken Tunisian Arabic in interviews, news and debate programs, the increasing use of language technologies for many sp...

متن کامل

They Want To Eradicate the Nation: A Cross-Linguistic Study of the Attitudinal Language of Presidential Campaign Speeches in the USA and Iran

Politicians adopt a variety of linguistic strategies in their speeches to connect with their audience. To name one, appraisal, as a system of interpersonal meaning, is concerned with evaluation where resources are used for negotiating social relationships. Despite their significance in shaping texts, there have hardly been any extensive inventories of appraisal tools contrasting electoral speec...

متن کامل

Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations

The Arabic language is a collection of dialectal variants along with the standard form, Modern Standard Arabic (MSA). MSA is used in official Settings while the dialectal variants (DA) correspond to the native tongue of the Arabic speakers. Arabic speakers typically code switch between DA and MSA, which is reflected extensively in written online social media. Automatic processing such Arabic ge...

متن کامل

Abstracts/Journal of the Arabic Language and LiteratureVol.14, No48, autumn 2018

Contents The Representation of Culture in Arabic pedagogy books to non-Arabic languages Danesh Mohammadi, Sakineh Zarenejad....................................................... 1 Critical Study of the‏‏manifestations of Mamluke's life from the novel “Alsaeroun‏‏niyam...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014